This dataset is from prosper and contains data for 113937 loans they made from 2006 to 2014. The main goal is to determine what variables set the interest paid by the borrowers (BorrowerRate)
To start this analysis is important to fully know the characteristics of the data set, next are the variables names, classes and summaries.
## 'data.frame': 113937 obs. of 10 variables:
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric.: int NA 6 NA 6 3 5 2 4 7 7 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## BorrowerRate EstimatedReturn ProsperRating..numeric.
## Min. :0.0000 Min. :-0.183 Min. :1.000
## 1st Qu.:0.1340 1st Qu.: 0.074 1st Qu.:3.000
## Median :0.1840 Median : 0.092 Median :4.000
## Mean :0.1928 Mean : 0.096 Mean :4.072
## 3rd Qu.:0.2500 3rd Qu.: 0.117 3rd Qu.:5.000
## Max. :0.4975 Max. : 0.284 Max. :7.000
## NA's :29084 NA's :29084
## EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
## Employed :67322 Min. : 0.0 Min. : 19.0
## Full-time :26355 1st Qu.:660.0 1st Qu.:679.0
## Self-employed: 6134 Median :680.0 Median :699.0
## Not available: 5347 Mean :685.6 Mean :704.6
## Other : 3806 3rd Qu.:720.0 3rd Qu.:739.0
## : 2255 Max. :880.0 Max. :899.0
## (Other) : 2718 NA's :591 NA's :591
## DebtToIncomeRatio IncomeRange LoanOriginalAmount
## Min. : 0.000 $25,000-49,999:32192 Min. : 1000
## 1st Qu.: 0.140 $50,000-74,999:31050 1st Qu.: 4000
## Median : 0.220 $100,000+ :17337 Median : 6500
## Mean : 0.276 $75,000-99,999:16916 Mean : 8337
## 3rd Qu.: 0.320 Not displayed : 7741 3rd Qu.:12000
## Max. :10.010 $1-24,999 : 7274 Max. :35000
## NA's :8554 (Other) : 1427
## LoanOriginationDate
## 2014-01-22 00:00:00: 491
## 2013-11-13 00:00:00: 490
## 2014-02-19 00:00:00: 439
## 2013-10-16 00:00:00: 434
## 2014-01-28 00:00:00: 339
## 2013-09-24 00:00:00: 316
## (Other) :111428
I observe that I need to transform some variables. First the Prosper rating is a number and we will need it as factor, also the loan origination date is in datetime format and I proceed to extract years to study the changes over time.
I will combine Credit Score Range Upper and Lower into an Average because they measure the same variable and rearrange the order of Income Levels for a better plot visualization.
Also, I observe that a lot of variables have NAs, so in order to examine data with all of the variables I will remove rows with NAs.
# Extract the first 4 characters of LoanOriginationDate because they represent
#the year in the datetime. i.e. 2014-01-22 00:00:00
ld$LoanOriginationDate.year = substr(ld$LoanOriginationDate, 1, 4)
# Combine Credit Score Ranges to obtain the average
ld$CreditScoreAverage <- (ld$CreditScoreRangeLower + ld$CreditScoreRangeUpper)/2
# Since the rating can be considered a categorical value, we will convert it to
#a factor variable to create box plots and help on in visualizations
ld$ProsperRating..factor. <- factor(ld$ProsperRating..numeric.)
#Rearrange Income Range factors to create a incremental order and create
#ordered visualizations
ld$IncomeRange <- factor(
ld$IncomeRange,
levels(ld$IncomeRange)[c(8,1,2,4,5,6,3,7)])
#remove rows without data for each variable to detect patterns between them
ld.no_na <- ld[complete.cases(ld),]
Next, I looked at the histograms for the BorrowerRate with 3 bins sizes to determine the shape of the distribution, I selected bins of 1%, 2,5% and 0.5%.
In addition, I created a Box Plot
To get a better understanding I plot the mean, median and interquartile measures in the 1% bin size histogram.
BorrowerRate’s summary statistics are:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1349 0.1845 0.1934 0.2524 0.3600
From the previous plot we observe that the distribution is close to a normal one but has a spike at about 0.32, which moves the mean and the median from the center of the curve. This must be further investigated.
Also, next I present the histograms for every variable on the data set in order to get a better grasp of their distributions.
Observations: It looks close to a normal distribution without tails, because the ratings only go from 1 to 7.
In addition is benefitial to the analysis the exact numbers of measures of central tendency. Next we have this measures for each of the costumbers groups by rating:
Summary Posper Rating == 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1779 0.3134 0.3177 0.3176 0.3177 0.3600
Summary Posper Rating == 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1518 0.2712 0.2925 0.2929 0.3149 0.3600
Summary Posper Rating == 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1158 0.2287 0.2489 0.2460 0.2624 0.3500
Summary Posper Rating == 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0895 0.1765 0.1914 0.1943 0.2099 0.3500
Summary Posper Rating == 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0693 0.1414 0.1509 0.1546 0.1649 0.2500
Summary Posper Rating == 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.0999 0.1119 0.1130 0.1239 0.2098
Summary Posper Rating == 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04000 0.06990 0.07790 0.07906 0.08490 0.21000
Summary Credit Score Average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 709.5 708.5 729.5 889.5
Observations: Close to normal distribution, moved towards the lower rates. Even thus, it is a continuous variable there are bins of size one each 20 points. Also the range is 280 (889.5-609.5)
Table of Loan Origination Date’s proportions:
##
## 2009 2010 2011 2012 2013 2014
## 0.02301533 0.06347590 0.12795750 0.22536973 0.41490775 0.14527380
Observation: in 2013 the company issued the majority of the database loans.
Observations: Debt to Income Ratio have a normal distribution, however the study could benefit from the exact measures of central tendency:
Summary Debt to Income Ratio
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1500 0.2200 0.2588 0.3200 10.0100
Observations: We can see that there are more loans of 15000 or lower, with peaks at around 4000, 10000 and 15000. Is there any significate variance through the years?
Observations: The bank almost never issues a loan that is expected to bring negative return, however there are some cases. I thought that the bank will try to get standard returns % for each loan, however the distributions seems normal with the majority of the values ranging from 4% to 16%.
The exact measures of central tendency are:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
Observations: as we suspected not employed people usually don’t get a loan, but neither does retired, self-employed and part-time people.
Observations: the mayority of the loans are made for coustumers who earn between 25000 and 75000, however those who earn more than 75000 also request a good amount of loans, the exact numbers are:
##
## Not employed $0 $1-24,999 $25,000-49,999 $50,000-74,999
## 1 0 3840 22023 24030
## $75,000-99,999 $100,000+ Not displayed
## 13644 14019 0
The dataset is in a wide format, with each row containing the information for a single loan, the loan and the borrower characteristics
The main feature of interest is the Borrower Rate, I want to be able to predict what rate a certain costumer would get.
In order to do this, I will take a closer look at the variables of related to the borrowers like credit score, income range, debt to income ratio, the amount of the loan and the rating prosper gives to each borrower.
Also, I created a credit average score, because it made sense to compare the data to a single credit than an upper and lower score. Also, I extracted the year from the loan origination date because I wanted to analyze the loans over the years and not a more granular time metric. Finally, I created a factor variable of the prosper ratings to be able to clearly differentiate the different borrowers in the plots.
I will use GGpairs as my first plot to get an overview of the relationship between the variables of study.
In order to improve readability, I will plot BorrowerRate vs with 3 variables at the time. The first set includes EstimatedReturn, ProsperRating..numeric. and ProsperRating..factor.
GGpairs plot #1 Observations: -EstimatedReturn and Prosper Rating Numeric have big correlations with Borrower Rate -BorrowerRate seems to be a normal distribution with a peak in higher levels. -BorrowerRate seems to decrease as Propsper Rating Factor Increases -BorrowerRate faceted by Posper Rating Factor seems to create normal distribution for each rating -Prosper Rating Numeric and Factor have normal distributions -Estimated Return is lower for better Prosper Rating (Beeing 1 the worst and 7 the best)
Next is the plot of EmploymentStatus, IncomeRange and LoanOriginationDate.year (numeric).
GGpairs plot #2 Observations: -Employment status does not seems to have a linear relation with BorrowerRate -The majority of loans are for 1 factor of Employment Status -As income range increases Borrower Rate summary statistics decreases -The first Income Rage bin have close to none loan -LoanOriginationDate.Year have a low correlation with Borrower Rate -The majority of the loans were made on one year and decreases significantly for the others.
To conclude the GGpairs plots its the plot of CreditScoreAverage, DebtToIncomeRatio and LoanOriginalAmount
GGpairs plot #3 Observations: -ECredit Score Average and Loan Original Amount have medium correlations (0.4-0.6) with Borrower Rate -Loan Original Amount histogram have 5 peaks. -Credit Score average have several peaks, and seems to be a normal distribution -Debt to Income Ratio has a low correlation with Borrower Rate
To have a better understading of the initial obervations I created plots to analyze each variable, first for Prosper Rating I created a Scatter Plot vs the main variable of study Borrower Rate:
I Observe that there are only round numbers for Prosper Rating, for that reason and overplotting I created a boxplot of the same variables to understand the measures of central tendency.
The previous plot demostrates how the spread and measures decrease as the rating increases. This raises the question, how does the rating is separated through the Borrower Rate Histogram?
The next plot answers it:
We can confirm that there is a clear separation of Borrower Rate by rating, also, the peak we saw earlier is conformed only by the costumers with rating 1. But how is the distribution for each prosper rating? Is it normal if?
As we can see in the previous plots, prosper ratings groups but rating 1 have a close to normal distributions.
From the previous plots, we can see that prosper rating is indeed a very influential variable in the borrower rate. In fact, we can see that the borrower rate mean, median, interquartile distributions, min and max decrease as the rating goes up.
Also, it is very clear that most of the borrower with rating 1 get approximately 0.31 Rate. Them costumers with rating 2 have a greater spread than those with rating 3, and this repeats up to rating 7, where the interquartile is very close to the median.
In addition, when I plotted the histograms faceted by rating we can see that all ratings but rating = 1 have a close to normal distribution.
Next, I proceed to study the influence of other variables on the Borrower Rate. For each continuous variable I analyzed the relationship vs Borrower rate and the variable histogram. After that I answer any particular question that arose from the previous plots.
Observations: There seems to be a negative linear relationship, however the spread in big and the confidence intervals of a model would be big as well.
If credit score has a negative relationship with Borrower Rate, and also does the prosper rating, how are they related to each other?
The previous plot does not provide any clear relationship, since for most of the Credit Scores Averages there are most of the possible Prosper Ratings.
Next, I analyzed Loan Origination year.
Observations: The rates become a little bit lower as the years increases.
The following variable of study is Debt to Income Ratio
Observations: The correlation of Debt To Income Ratio and Borrower rate is weak because we dont see any close to linear pattern.
We can conclude that Debt to Income Ratio has a interquartile range of 17%, and there is a sorprising max of 10.01. However since the mean is not very affected by this outliers we know that the most of the values are relatively close to 0.22
Also, since Prosper Rating has the greatest correlation with Borrower Rate, I wonder, how Debt to Income Ratio and the rating are related?
Observations: as the rating increases the debt to income ratio decreases, however, the difference between the rating 1 group mean and the rating 7 mean is only 10% greater, which is surprising because one might think that such different costumers will have a greater difference on this variable.
Continuing the analysis, I wanted how the debt variable changed over the years.
Date.year
Observations: we can see that the Debt to Income Ratio has stayed fairly similar through the years, with a little increment the last years.
To continue I analyzed the Loan Original amount, I tought that with higher amounts the bank probably request more interest.
Observations: my initial thoughts were incorrect, since the lower amounts have greater intereset.
Observations: We can see that the amount have increased over time, we should remember from previous plot that in 2013 is where the majority of the loans were originated, and now we know that also in the same year the bank loaned the greatest amounts.
Following the data set analysis, we analyze the bank loans’ Estimated Return
Observations: as a whole, there is no strong correlation, however there are some interesting lines, maybe another variable will help create linear relationships
Nonetheless, we should examine if this returns varies over the principal factor of influence, the Prosper Rating:
Observations: the rating does influence the estimated return, we know that better ranking has lower BorrowerRates, and that is in accordance because the bank is expecting lower returns for those ratings.
Continuing to the Employment Status Variable
Observations: there is no clear relationship between the variables, the spreads are mostly big, excluding not employed, that might be because the bank does not issued for most of the unemployed costumers. We could check this with a bar graph of Employment Status Count.
To last variable of the bivariate analysis in the Income range. First, we look at the distribution for each category
Observations: as costumers earn more they get lower rates. However we should analyze how is the distributions of loans for each Income Rage group. Also we know not employed people don’t usually get a loan, and the majority of the loans are for people who earn 25000 - 75000. However there are over 25000 loans made to people who earn 75000 or more. We know that the majority of the loans are under 15000, and for that reason we can conclude that even people with high income request low amount loans.
Because of the decreasing Borrower Rate pattern of Income Ranges I wanted to see if Income Range plays an important role on Prosper Rating.
Observations: There are people with all the rating accros income range, even thus the rating mean increases as the income does, there is no clear correlation.
The result of the previous plots clearly shows that almost all of the variables have little to no effect on the borrower rate, this includes de date, the amount, the employment status, and debt to income.
The principal factor is the prosper rating. From the client side, besides it only income range, loan amount and credit score seems to have some influence on the rate, decreasing it as the conditions are more favorable.
In addition, the expected return plot is interesting, there seems to be some linear relationships on it, another variable might help clear this out.
Since it is clear that the main factor on the borrower rate is the prosper rating, I will try to discover other relationships between this rating and the other variables differencing each rating in the previous plots.
Observations: As the credit score increases, so does the ranking, however the spreads are big.
Observations: we see that over the years the groups of ratings have kept a fairly similar borrower rate
Observations: Now that I included prosper rating as a factor, we know that even if the Debt to Income Ratio increases significantly, the borrower rate does not change as much within groups.
Observations: In this plot, we can corroborate that only higher rates get high loan amounts, however, costumers for each rating group request low amount loans. One of the possible reasons behind this is that people with high income want to improve their credit having a well payed low amount loan.
Observations: this is a very interesting graphs of the analysis, because we can see that for each Prosper Rating group, there is a linear relationship between how much interest they pay and how much does the bank earns. This indicates that the banks rating is indeed the most influential variable on rate, however for the same group of people prosper set higher rates to obtain higher returns. It would be interesting to study this groups isolated to determine what makes the bank want to earn more or less within very similar financial profiles. In addition it is important to note that for the group of rating 1 there seems to be ar least 4 linear relationships, probably there is another variable that is differentiating what the bank expects to return from them.
Observations: we see that across the Income Ranges the patter of negative correlations between rating and rate is really similar.
Observations: As in the Income Range Boxplots, we see similar boxplots for each Employment Status, however we know that the vast majority of the loans are for employed costumers and for that reason the other boxplots have few data.
First, we can conclude that neither LoanOriginationDate, DebtToIncomeRatio, LoanOriginalAmount, IncomeRange nor EmploymentStatus have any clear relationship with the Borrower Rate.
Also, we can see that prosper rating is influenced by credit score average, however, the spread is big, with very different credit scores having the same rating.
In addition, we can see that over the years the rates for the ratings groups was maintained.
Lower ratings tend to have bigger debt to income rating, however this is not a rule, there are people with rating 1 and close to 0 debt to income ratio.
Lower ratings tend to get lower loan amounts, and as the loan amount increases, we see that the amounts tend to be round numbers, there are several loans made with amount of 10,000, 15,000, 20,000, 25,000, 30,000 and 35,000.
There are linear relationships for the estimated return and borrower rate when we group the data by rating, this describes that the bank calculates the higher rate the higher the return will be, a relationship closes to +1. However, we can also see that even with higher rates, the lower ratings are expected to return very low. Especially costumers with prosper rating of 1 tend to create loses. In addition, there seems to be another variable that divides that group, because multiple linear relationships within the group.
In the first Plot, I observe that the distribution of the borrower rates (interest) for the loans its similar to a normal distribution if rating 1 is omitted, because almost of the costumer with this range get about 0.32 rate. Also, the other groups seem are in order and clearly separated, best ratings obtains bests rates. This is measured by the correlation, which is -0.953, a very strong relationship
This plot is interesting because it demonstrated that better credit scores tend to get better borrower rates, however the relationship is not strong with a correlation of -0.529. Also, the fact that same credit score averages tend to have several costumers with different ratings. For example, we can find all the ratings in good amount in the 730-credit score average bin. This is very surprising for me because I thought that this will be a very influential variable on borrower rate and prosper rating. We can conclude that there are other variables influencing the rating and borrower rate.
This final plot stands out because even thus I did not find any linear relationships in the other variables, here we see that if we divide each group of ratings, we could create several linear relationships.
This describes that the bank calculates that the higher rate the return will be higher too, a relationship close to +1. However, we can also see that this is the case for the ratings besides 1, where even the higher ratings are expected to return lower.
Also, the bank seems to set very different rates for the people with similar financial characteristics, there must be another variable that influence how much prosper wants to earn. So if someone wants to obtain a lower rate, even thus rating is the most influential variable, is not the only one they should take into consideration.
Finally costumers with prosper rating 1 tend to create loses. Also, there there seems to be another variable that divides that group, because multiple linear relationships may be within the group.
Data analysis is a process where you first you have to understand each variable and the problem before doing any plots, because if you don’t understand the situation you can’t formulate the questions that will get the answers you need.
It is very important to clean up the data, because some variables might have classes or formats that will produce bad visualizations and findings. After that decide which is your main characteristic of interest, what are you trying to predict and study that variable alone, only after you have an understanding of it you can explore the relationships it has with the other variables.
Create plots for all the variables and try to find key findings that will lead your investigation until you find an explanation of the phenomenon’s discovered.
On a final note, I believe that the data is as valuable as it is real or trustworthy. Even thus is it from a great source the analyst must be aware of the limitations of the data due to how was is recollected, from who and under what circumstances.
Regarding this data set, it would be valuable to study the rest of the variables to discover what influences Porsper rating, because if it can be predicted we could predict the loan characteristics better. Also, it would be interesting to compare the estimated return with the actual return of the loans.